17. Parsers

Parsers

In the following lessons you will learn how to use the BeautifulSoup library to pull data out of HTML and XML files. BeautifulSoup uses a parser to transform files into a tree of Python objects that can be easily searched. So, before we start learning how to use BeautifulSoup, let's take a quick look at parsers.

In BeautifulSoup, the parser is a piece of software whose primary job is to build a data structure in the form of a hierarchical tree that gives a structural representation of the HTML or XML file. In other words, the parser divides these complex files into simpler parts while keeping track of how these parts are related to each other. BeautifulSoup supports a number of parsers, but throughout these lessons we will only be using the lxml parser. The lxml parser can be used to parse both HTML and XML files and has the advantage of being very fast. In order to use the lxml parser, you must have lxml installed. You can install the lxml parser by using the following command in your terminal:

$ pip install lxml

If you're working with perfectly formatted HTML or XML files ( i.e. files that don't contain any missing information or mistakes) then, in the majority of cases, your choice of parser shouldn't really matter. However, if the files you are working with have missing information or mistakes, then your choice of parser will matter because each parser has different rules for dealing with missing information or mistakes. Consequently, in these cases, different parsers will create different parse trees for the same document. You can take a look at the differences between parsers , in the BeautifulSoup documentation, for details.